RED WINE ANALYSIS by Susana Chicano

Introduction

I chose the dataset Red Wine Quality because I really like red wine.

To build this dataset, at least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The dataset is related to red variants of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc).

The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

Here is a list of the attributes:

1 - Fixed acidity: most acids involved with wine are fixed or nonvolatile (do not evaporate readily).

2 - Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

3 - Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines.

4 - Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.

5 - Chlorides: the amount of salt in the wine.

6 - Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

7 - Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

8 - Density: the density of water is close to that of water depending on the percent alcohol and sugar content.

9 - PH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

10 - Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

11 - Alcohol: the percent alcohol content of the wine.

Loading the packages

The first step is to load all the necesary packages.

Reading the database

After downloading the packages, I read the database with the read.csv function.

Summary

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This tidy data set contains 1,599 red wines with 13 variables on the chemical properties of the wine. Most of the columns (variables) are of the “numerical” data type, except for total.sulfur.dioxide and quality, which are “integers”. Let’s remove the X column, since it looks like the index column.

After deleting the X column, we have 12 variables.

The main feature of the dataset is the quality of the wine. Other interseting figures in the dataset are the alcohol level, the sugar level and also the acidity. Those will help me better understand why a wine is considered a good quality wine.

Univariate Analisys

We individually analized every variable using histogram plots. The independent variable is in the x axis, and the dependent variable (count) represented in the y axis.

Chlorides, Free sulfur dioxide, sulphates, alcohol, residual sugar, citric acid and total sulfur dioxide are squewed to the right, with most of the ocurrences concentrated on the left side of the chart (or the lower levels of each variable).

Fixed acidity, pH, volatile acidity, density, sulphates and quality resemble normal distributions with most of the ocurrences in the middle values.

To better understand some of the plots, I decided to use the scale_x_log10 and tweaked the binsize. Now, I have a more clear view with a normal disributed graph. We only transform the ones that are not normaly distributed.

In the following charts I am using bloxplots to visualize the outliers.

Statistics

Let’s print a summary of each variable.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The summary provides us with statistical information for every variable in the data set: Min. value, 1st Qrtl., Median, Mean, 3rd Qrtl. and Max. value.

I will also analyze the variables frequency.

The most common fixed acidity value is 7.2 and 0.08 is the most common value for chlorides. Most common PH is 3.3. And when looking at residual sugar, 2 is the most typical value. The most frequent alcohol value is 9.5. If we look at the quality, we can see that most of the vines analized are considered quality 5 or 6. I also looked at the number of wines with 0 citric acid value, and it was 132, which was surprisinly high.

Let’s take a closer look at the “quality” variable. I am printing a summary and a frequency table.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The min qualily is 3, the max is 8 and the average is 5.636. Most common quality among the red wines studied are 5 and 6.

We are going to compare the quality of the wine, based on different variables. For that purpose and to better visualize the results, I will create thre quality categories: Poor (3,4), Average (5,6), Best (7,8). I create a new column called Rating, which will include those new parameters.

##    Poor Average    Best 
##      63    1319     217

We see that average wines are the most common.

Byvariate Analisys

Let’s run bivariate analisys of for each varialble within each rating category.

The previous charts don’t really provide us with very good visualizations. Let’s try with boxplots.

Now we can better asess how each of the variables affect the rating of the wines.

The best wines have higher levels of fixed acidity, sulphates, citric acid, alcohol. On the other side, they have lower levels of pH and volatile acidity. The level of chlorides, sulfur, and their density don’t really affect the quality.

Another way to compare the metrics for each category is to use the summary feature. Here are a couple of examples.

## Rating and alcohol level:
## df$Rating: Poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.22   11.00   13.10 
## -------------------------------------------------------- 
## df$Rating: Average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## df$Rating: Best
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00
## Rating and fixed acidity level:
## df$Rating: Poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.871   8.400  12.500 
## -------------------------------------------------------- 
## df$Rating: Average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.100   7.800   8.254   9.100  15.900 
## -------------------------------------------------------- 
## df$Rating: Best
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.700   8.847  10.100  15.600
## Rating and citric acid level:
## df$Rating: Poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.0800  0.1737  0.2700  1.0000 
## -------------------------------------------------------- 
## df$Rating: Average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2400  0.2583  0.4000  0.7900 
## -------------------------------------------------------- 
## df$Rating: Best
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.4000  0.3765  0.4900  0.7600

We observe the same results as we saw with the plots, when we compare the variable mean, for each rating category.

Multivariate analysis

Here I am going to use scatterplots to visualize three variableas at a time. Interesting to see that the best wines have lower pH and volatile acidity. Also, best wines have a higher alcohol content but their density is all over the spectrum. Higher sulphate levels and low volatile acidity caracterize best wines as well.

I want to run a simple correlation analysis between residual sugar and alcohol. Looking at the plot below, it is surprising to find no correlation (negative) between those two variables. Residual sugar measures the amount of sugar left after the fermentation. Fermentation converts sugar into alcohol, so less residual sugar would necesarily mean that there would be more alcohol, and viceversa.

I am taking it a step further, and calculate the correlation between all the variables. I will do that using the matrix correlation graph.

##                     fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.00            -0.26        0.67
## volatile.acidity            -0.26             1.00       -0.55
## citric.acid                  0.67            -0.55        1.00
## residual.sugar               0.11             0.00        0.14
## chlorides                    0.09             0.06        0.20
## free.sulfur.dioxide         -0.15            -0.01       -0.06
##                     residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.11      0.09               -0.15
## volatile.acidity              0.00      0.06               -0.01
## citric.acid                   0.14      0.20               -0.06
## residual.sugar                1.00      0.06                0.19
## chlorides                     0.06      1.00                0.01
## free.sulfur.dioxide           0.19      0.01                1.00
##                     total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                      -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                    0.08    0.02  0.23     -0.26   -0.20
## citric.acid                         0.04    0.36 -0.54      0.31    0.11
## residual.sugar                      0.20    0.36 -0.09      0.01    0.04
## chlorides                           0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                 0.67   -0.02  0.07      0.05   -0.07
##                     quality
## fixed.acidity          0.12
## volatile.acidity      -0.39
## citric.acid            0.23
## residual.sugar         0.01
## chlorides             -0.13
## free.sulfur.dioxide   -0.05

Very interesting findings from this visualization, and many of them make total sense.

Fixed acidity is positively correlated to the amount of citrid acid in the wine and its density, and is negatively correlated to pH. Volatile acidity is positively correlated to citric acid content. Citric acid also negatively correlates to pH. We also see that the higher the alcohol level, the lower the density. Residual sugar doesn’t seem to be correlated to any other factor. But we see that alcohol is positively correlated to quality. In other words: the higher the alcohol level, the higher the quality. This correlation is not that significant though.

Final Plots and Summary

Plot One - Quality histogram

Description One

This plot is a frequency histogram. It shows how many wines were rated quality 3, 4, 5, 6, 7 and 8 (3 being the lowest quality and 8 being the highest). Most ost of the wines studied were rated in the middle range (6 and 7). It shows a normal distributed chart, denser in the center and less dense in the tails.

Plot Two - Boxplot analysis of individual variables by Rating

Description Two

These are actually several plots, boxplots in particular. They show us the median, the min value, max value, and the IQR (Interquartile Range). It also provides us with outliers visualizations.

Here we can see very clearly how each variable affect the wine ratings. For instance, we observe that wines perceived best, have a higher level of citric acid. Or, that the lower the volatile acidity (which gives it the unpleasant vinegar taste), the better the wine.

Plot Three - Correlation between volatile acidity and pH

Description Three

This is a correlation scatterplot, which also takes into consideration a third variable. In this case, we are looking at the correlation between ‘pH’ and ‘volatile acidity’, color-coding the third variable ‘rating’. We see that the quality of wine decreases as the pH and the volatile acidity increases.

Reflection

I found this wine analysis quite interesting. I did not face too many challenges and really enjoyed digging into the data. It was surprising that residual sugar does not really affect how the wines are perceived and no so surprising that the wines perceived as best contain more alcohol.

The fact that the database is not balanced might have had an impact in the analysis. There are munch more normal wines than excellent or poor ones.

In future reserach, I would love to be able to include the following variables: brand, region, and price. It would be great to analyze if the percepion of best wines has any correlation to price.